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^ | ' Let X e R nxd be a data matrix of rank p, representing n points in R d . The 

linear support vector machine constructs a hyperplane separator that maximizes the 
1-norm soft margin. We develop a new oblivious dimension reduction technique which 
is precomputed and can be applied to any input matrix X. We prove that, with high 
probability, the margin and minimum enclosing ball in the feature space are preserved 
to within e-relative error, ensuring comparable generalization as in the original space. 
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We present extensive experiments with real and synthetic data to support our theory. 
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^ ! 1 Introduction 

>■ 

The Support Vector Machine (SVM) (Christianini & Shawe- Taylor, 2000 [1] ) is a popular 
classifier in machine learning today. The training data set consists of n points Xj G M. d , 
with respective labels j/j G {— 1,+1} for i = l...n. For linearly separable data, the 
primal form of the SVM learning problem is to construct a hyperplane w* which maximizes 
the geometric margin (the minimum distance of a data point to the hyperplane), while 
separating the data. For non-separable data the "soft" 1-norm margin is maximized. The 
dual lagrangian formulation of the problem leads to the following quadratic program: 
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X 

5_i , , 

a \ xuii) i=i * ij=i 

subject to y yiOti = 0, 
i=l 

< on < C, i = 1 . . . n. 

In the above formulation, the unknown lagrange multipliers {ai}" =1 are constrained to lie 
inside the "box constraint" [0,C] n , where C is part of the input. In order to measure the 
out-of-sample performance of the SVM, we can use the VC-dimension of /ai-separators. 
Assuming that the data lie in a ball of radius B, and that the hypothesis set consists 
of hyperplanes of width 7 (corresponding to the margin), then the 1/C-dimension of this 



"Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY, USA, pauls2@rpi.edu 
f Mathematical Sciences Department, IBM T.J. Watson Research Center, Yorktown Heights, NY, USA, 
cboutsi@us . ibm. com 

^Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY, USA magdon@cs.rpi.edu 
^Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY, USA, drinep@cs.rpi.edu 



1 



hypothesis set is 0(B 2 /-/ 2 ) (Vapnik,1998 |2j). Now, given the in-sample error, we can obtain 
a bound for the out-of-sample error, which is monotonic in the VC-dimension (Vapnik & 
Chervonenkis, 1971 [3]). 

The main intuition behind our work is that if we can preserve a subspace geometry, 
then we should be able to preserve the performance of a distance based algorithm. We 
construct dimension reduction matrices R G M. dxr which produce r-dimensional feature 
vectors 5q = R T Xjj the matrices R do not depend on the data. We show that for the data 
in the dimension-reduced space, the margin of separability and the minimum enclosing 
ball radius are preserved, since the subspace geometry is preserved. So, an SVM with 
an appropriate structure defined by the margin (width) of the hyperplanes (Vapnik & 
Chervonenkis, 1971 [3]) will have comparable VC-dimension and, thus, generalization error. 
To state our results precisely, we first need some SVM basics. 

1.1 Notation and SVM Basics 

A,B,... denote matrices and a,,... denote column vectors; (for all i = l...n) is 
the standard basis, whose dimensionality will be clear from context; and I n is the n x n 
identity matrix. The Singular Value Decomposition (SVD) of a matrix A G M. nxd of 
rank p < min{n, d} is equal to A = US1V T , where U G M. nxp is an orthogonal matrix 
containing the left singular vectors, S G M pxp is a diagonal matrix containing the singular 
values 0"i > 0"2 > . . . dp > 0, and V G M. dxp is a matrix containing the right singular vectors. 
The spectral norm of A is ||A|| 2 = o\. 

We introduce matrix notation that we will use for the remainder of the paper. Let 
X G W ixd be the matrix whose rows are the vectors x?\ Y G W ixn be the diagonal matrix 
with entries Yu = yi, and a = [a\, 02, • • • , a n ] G M. n be the vector of lagrange multipliers 
to be determined by solving eqn. eqmsvml. The SVM optimization problem is 

max l T a - -« T YXX T Yc* 

2 (2) 

subject to l T Ya = 0; and < a < C. 

(In the above, 1, 0, C are vectors with the implied constant entry.) Let a* be an 
optimal solution of the above problem. The optimal separating hyperplane is given by 
w* = X r Ya* = X^ILi y* a i x «> an d the points Xj for which a* > 0, i.e., the points which 
appear in the expansion w*, are the support vectors. The geometric margin, 7*, of this 
canonical optimal hyperplane is 7* = 1/ ||w*|| 2 , where 1 1 w* 1 1 2 = Y17=i a 1- ^ ne data radius 
is B = min x * max Xi ||xj — x*|L. It is this 7* and B that factor into the generalization 
performance of the SVM through the ratio B/'j*. 

1.2 Dimension Reduction 

Our goal is to study how the SVM performs under (linear) dimensionality reduction trans- 
formations in the feature space. Let R G M. dxr be the dimension reduction matrix that 
reduces the dimensionality of the input from d to r <C d. We will choose R to be a random 
projection matrix (see Section [2]). The transformed dataset into r dimensions is given by 
X = XR, and the SVM optimization problem becomes 

maxl T d! - -Q T YXRR T X T Yd, 

a 2 (3) 

subject to 1 t Y6l = 0, and < 6l < C 
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Solving the dimensionally-reduced problem above is computationally more efficient than 
solving the original, cf-dimensional problem. We will present a construction for R that 
leverages the fast Hadamard transform. The running time needed to apply this construction 
to the original data matrix is O (ndlogr). Notice that while this running time is nearly 
linear on the size of the original data, it does not take advantage of any sparsity in the 
input. In order to address this deficiency, we leverage the recent work of Clarkson & 
Woodruff, (2012) [4J, which proposes a construction for R that can be applied to X in 
O (nnz(X) log (/?)) time; here nnz (X) denotes the number of non-zero entries of X and p is 
the rank of X. To the best of our knowledge, this is the first independent implementation 
and evaluation of this potentially ground-breaking random projection technique (a few 
experimental results were presented by Clarkson & Woodruff, (2012) [4]. All constructions 
for R are oblivious of the data and hence they can be precomputed. Also, all generalization 
bounds that depend on the final margin and radius of the data will continue to hold. 

1.3 Our Contribution 

Let p be the rank of X. In the transformed space, let the resulting margin after solving the 
optimization problem be 7*, and assume that the projected data have data radius B. Our 
main theoretical result is to show that, for suitably chosen values of r, both the margin and 
the data radius are preserved to relative error: 

7* 2 > (l-e) 7 * 2 ; B 2 <(l + e)B 2 . 

Thus, it is possible to obliviously reduce the dimension of the data while preserving the good 
generalization properties of the SVM. We briefly discuss the appropriate values of r: if R is 
the randomized Hadamard transform, we need to set r = O (pe~ 2 log 2 (dpe~ 2 )) ; if R is con- 
structed as described in Clarkson & Woodruff, (2012) [3], then r = O (pe~ A log (p/e) (p + log (1/e))) . 
The running time needed to apply the former transform on X is O (nd log p) ; the running 
time needed to apply the latter transform is O (nnz(X) log p). 

1.4 Prior work 

The work most closely related to our results is that of Krishnan et al., (2008) [5], which 
improved upon Balcazar et al., (2001) [6]. Krishnan et al., [5] showed that by using sub- 
problems based on Gaussian random projections, one can obtain a solution to the SVM 
problem with a margin that is relative-error close to the optimal. Their sampling complexity 
(the parameter r in our parlance) depends on S 4 , and, most importantly, on I/7* 2 . This 
bound is not directly comparable to our result, which only depends on the rank of the 
data manifold, and holds regardless of the margin of the original problem (which could be 
arbitrarily small). Our results dramatically improve the running time needed to apply the 
random projections; our running times are (theoretically) linear in the number of non-zero 
entries in X, whereas (Krishnan et al., (2008) [5]) necessitates 0(ndr) time to apply R on 
X. 

Shi et al., (2012) [7] establish the conditions under which margins are preserved after 
random projection and show that error free margins are preserved for both binary and multi- 
class problems if these conditions are met. They discuss the theory of margin and angle 
preservation after random projections using Gaussian matrices. They show that margin 
preservation is closely related to acute angle preservation and inner product preservation. 
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Smaller acute angle leads to better preservation of the angle and the inner product. When 
the angle is well preserved, the margin is well-preserved too. There are two main differences 
between their result and ours. They show margin preservation to within additive error, 
whereas we give margin preservation to within relative error. This is a big difference 
especially when the margin is small. Moreover, they analyze only the separable case. We 
analyze the general non-separable dual problem and give a result in terms of the norm of the 
weight vector. For the separable case, the norm of the weight vector directly relates to the 
margin. For the non-separable case, one has to analyze the actual quadratic program, and 
our result essentially claims that the solution in the transformed space will have comparably 
regularized weights as the solution in the original space. 

Shi et al., (2009) [8] used hash kernels which approximately preserved inner product to 
design a biased approximation of the kernel matrix. The hash kernels can be computed in 
the number of non-zero terms of a data matrix like the method of Clarkson & Woodruff 
[I] used in our case. Shi et al., (2009) [8] used random sign matrix to compute random 
projections which increased the number of non-zero terms of the data matrix. However, 
the method of Clarkson & Woodruff [4] takes advantage of input sparsity. Shi et al., (2009) 
[8] showed that their generalization bounds on the hash kernel and the original kernel 
differed by the inverse of the product of the margin and number of datapoints. For smaller 
margins, this difference will be high. Our generalization bounds are independent of the 
original margin and hold for arbitrarily small margins. 

Finally, it is worth noting that random projection techniques have been applied ex- 
tensively in the compressed sensing literature, and our theorems have the same flavor to 
a number of results in that area. However, to the best of our knowledge, the compressed 
sensing literature has not investigated the 1-norm soft-margin SVM optimization problem. 

2 Random Projection Matrices 

Random projections are extremely popular techniques in order to deal with the curse-of- 
dimensionality. Let the data matrix be X G M. nxd (n data points in M. d ) and let R G M dxr 
(with r < (I) be a random projection matrix. Then, the projected data matrix is X = 
XR G R nxr (n points in R r ). If R is carefully chosen, then all pairwise Euclidean distances 
are preserved with high probability. Thus, the geometry of the set of points in preserved, 
and it is reasonable to hope that an optimization objective such as the one that appears in 
SVMs will be only mildly perturbed. 

There are many possible constructions for the matrix R that preserve pairwise distances. 
The most common one is a matrix R whose entries are i.i.d. standard Gaussian random 
variables (Indyk & Motwani, 1998 [9j ; Dasgupta & Gupta, 2003 [TO]). Achlioptas (2003) 
[llj argued that the random sign matrix - RS for short - e.g., a matrix whose entries are 
set to +1 or —1 with equal probability, also works. These constructions take O (ndr) time 
to compute X. 

More recently, faster methods of constructing random projections have been developed, 
using, for example, the Fast Hadamard Transform (Ailon & Chazelle, 2006 [12]) - FHT 
for short. The Hadamard- Walsh matrix for any d that is a power of two is defined as 



H, 



■d/2 



G M' 



dxd 
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with Hi = +1. The normalized Hadamard- Walsh matrix is yl ^H^, which we simply denote 
by H. We set: _ 

R-srht = ^ DHS, (4) 

ndxd 



a rescaled product of three matrices. D G M. is a random diagonal matrix with D 

l 

2" 



equal to ±1 with probability i. H G M dxd is the normalized Hadamard transform matrix. 



S G M dxr is a random sampling matrix which randomly samples columns of DH; specif- 
ically, each of the r columns of S is independent and selected uniformly at random (with 
replacement) from the columns of Lj, the identity matrix. This construction assumes that 
d is a power of two. If not, we just pad X with columns of zeros (affecting run times by at 
most a factor of two). The important property of this transform is that the projected fea- 
tures X = XR can be computed efficiently in O (ndlnr) time (see Theorem 2.1 of (Ailon & 
Liberty, 2008 |13] ) for details). An important property of R (that follows from prior work) 
is that it preserves orthogonality. 

While the randomized Hadamard transform is a major improvement over prior work, 
it does not take advantage of any sparsity in the input matrix. To fix this, very recent 
work (Clarkson & Woodruff, 2012 [5j) shows that carefully constructed random projection 
matrices can be applied in input sparsity time by making use of generalized sparse embed- 
ding matrices. To understand their construction of R, assume that the rank of X is p and 
let r = O (pe- A log(p/e) (p + log (1/e))) . Then, let k = G (e~ 2 log (r/e)), let v = 8(e -1 ), 
and let q = r/k be an integer (by appropriately choosing the constants). The construction 
starts by letting h : 1 . . . d — ^ 1 ... g be a random hash function; then, for i = 1 ... q, let 
a,i = \h~ l (i)\ and let d = J2i=i a i- The construction proceed by creating q independent 
matrices Bi . . . B g , such that Bj G R fcxai . Each Bj is the concatenation (stacking the rows 
of matrices on top of each other) of the following matrices: y^^iDi . . . y^^^D^. The 
matrix 3?iDj G R aiX,) is defined as follows: for each m G {1 . . . a^}, h(m) = g' , where g' is 
selected from {1 . . . v} uniformly at random. 3>i is a v x a% binary matrix with 3?h(m),m = 1 
and all remaining entries set to zero. D is an a% x a% random diagonal matrix, with each 
diagonal entry independently set to be +1 or —1 with probability 1/2. Finally, let S be the 
block diagonal matrix constructed by stacking the Bj's across its diagonal and let P be a 
d x d permutation matrix; then, R = (SP) . We will call the method of Clarkson & 
Woodruff, (2012) [4J to construct a sparse embedding matrix CW. 



3 Geometry of SVM is preserved under Random Projection 

We now state and prove our main result, namely that solving the SVM optimization problem 
in the projected space results in comparable margin and data radius as in the original space. 
These results are dependent on the main technical result from numerical linear algebra 
literature which we state in the lemma below. 

Lemma 1. Fix e G (0, |], 5 G (0,1]. Let V G R dxp be any matrix with orthonormal columns 
and setR = R SBHT as in eqn\Q withr = 0(pe~ 2 ■log(pd5~ 1 )-log(pe~ 2 5~ 1 log(pd5^ 1 ))). Then, 
with probability at least 1 — 5, 

||V T V- V T RR T V|| 2 < e. 

Remark. Let R be the random sign matrix described in Sectiond Then, || V T V - V T RR T V|| 2 < 
e still holds with probability at least 1 — 1/n, by setting r = O (pe~ 2 log p log a) . The proof 
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of this result is essentially the same, using Theorem 3. 1 (i) of (Magen & Zouzias, (2011) 
|14j). A similar result can be proven for the construction of (Clarkson & Woodruff, 2012 
[I]) by setting r = O (pe~ 4 log (p/e) (p + log (1/e))) . We include these lemmas in the sup- 
plementary material. 

Theorem 1. Let e £ (0, i] be an accuracy parameter, R £ ~R dxr be any matrix for which 
||V T V — V T RR r V|| 2 < e, and let X = XR. Let 7* and 7* be the margins obtained by 
solving the SVM problems using data X and X respectively (eqns. (0|) and ^)). Then, 
I* 2 > (1-e) -7* 2 . 

In words, Theorem [T] argues that for suitably large r (linear in the rank of X up to 
logarithmic factors), the margin is preserved. Theorem [1] will follow from the technical 
result that ||V T V — V T RR T V|| 2 < e holds with high probability depending on the choice 
of the random projection matrix. 

Proof: (of Theorem CP Let E = V T V - V T RR T V, and a* = [a\, a* 2 , . . . , a* n ] T £ R n be 
the vector achieving the optimal solution for the problem of eqn. ([2]) in Section [TJ Then, 



J opt 



1 

a * - -a* T YXX T Ya* 

i=i 

n 

a- - -q* t YUSV t VSU t Yq* 

i=l 
n 

JJa* - -a* T YUSV T RR r VSU T Ya* 



2 

i=l 

-^a* T YUSESU T Ya*. (5) 

Let a* = [a?, 5^, . . . , Q^] T £ IR n be the vector achieving the optimal solution for the 
dimensionally-reduced SVM problem of eqn. ([3]) using X = XR. Using the SVD of X, we 

get 

n 

Z opt = ^Ja* - -a* r YUSV T RR T VSU T Ya*. (6) 

i=i 2 

Since the constraints on ct*, a* do not depend on the data (see eqns. ([2]) and ©), it is clear 
that a* is a feasible solution for the problem of eqn. ([2]). Thus, from the optimality of a* , 
and using eqn. ©, it follows that 



Z opt = V^ a * _ la* T YUSV r RR r VSU r YQ* 

i=l 

- -a* T YUSESU T Ya* 
2 

n 

> ^ 5$ - -Q* T YUSV T RR T VSU r YQ* 



i=l 

- -a* T YUSESU T Ya* 

2 

Z opi - iQ* T YUSESU T Yd*. (7) 
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We now analyze the second term using standard sub-multiplicativity properties and V T V 
I. Taking Q = ||d* T YU5]|| 2 

id* r YU£E£U T Yd* < i ||Q|| 2 ||£|| 2 ||Q T || 2 

= 2 ll E "2 HQH2 

= - IIEIL ||d* T YU£V T ||^ 

2 11 112 II II2 



-||E|| 2 ||«* J YX|| 2 . (8) 



Combining eqns. (J7|) and (jSJ), we get 



Z opi > Z opt -^\\E\\ 2 ||a* T YX|| 2 . (9) 

We now proceed to bound the second term in the right-hand side of the above equation. 
Towards that end, we bound the difference: 

|d* T YXRR T X T Yd* - d* T YXX T Yd* | 
= I d* T YUS ( V T RR T V - V T V) SU T Yd* | 
= I a* T YUS (-E) £U r Yd* \ 
< ||E|| 2 ||d* r YU£|| 2 
= ||E|| 2 ||d* r YUSV T || 2 
= ||E|| 2 ||d* r YX|| 2 . 



We can rewrite the above inequality as 
thus, 



|q* t YXR|| 2 - ||d* T YX|| 2 



< ||E|| 2 ||d* T YX|| 2 ; 



||d* T YX||' < L —- ||d* T YXR||' . 

II II2- x _ || E || 2 II II2 

Combining with eqn. ([9]), we get 

z opt > z opt - 1 II«* Tyx R||2 • ( 10 ) 

Now recall from our discussion in Section [T] that w* T = a'^YX, w* T = d* T YXR, 
Il w *ll2 = Sr=i a I' an d [| vir* [| 2 = Y17=i®i- Then, the optimal solutions Z opt and Z opt 



can be expressed as follows: 

Z11 A 11 ~* 11 *ii^ 

opt = ||W || 2 - - ||W || 2 = - ||W || 2 , (11) 



1 11 *m2 1 11 *||2 



Combining eqns. ffTOj). (fTTj) . and (|12j) . we get 



Z 1 1 ~ * 1 1 2 m~*ii2 m~*ii2 

opt = ||w || 2 - -||w || 2 = -||w || 2 . (12) 



\E\\ 



I *l|2 \ II ~ *||2 / ^ 2 \ II - *||2 
W L > W L — ,, , W 11 



2 1 * v 1 1- \\E\\J 1 l! - 

1-^112 \ 11.,,- 1 2 



\E\\2 



w%. (13) 
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Let 7* = Hw*!!^ 1 be the geometric margin of the problem of eqn. ([2]) and let 7* = 
||w*||2 1 be the geometric margin of the problem of eqn. ([3|). Then, the above equation 
implies: 



7* 2 < 



T 2 > 



\E\ 



\E\\ 



\E\ 



\E\\ 



-,*2 



*2 



7 



(14) 



Our second theorem argues that the radius of the minimum ball enclosing all projected 
points (the rows of the matrix XR) is very close to the radius of the minimum ball enclosing 
all original points (the rows of the matrix X). 

Theorem 2. Let e £ (0, ±] be an accuracy parameter and consider the SVM formulations 
of eqns. (0|) and (Ej), let B be the radius of the minimum ball enclosing all points in the full- 
dimensional space, and let B be the radius of the ball enclosing all points in the projected 
subspace. For R as in Theorem^ with probability at least 1 — 5, B 2 < (1 + e)B 2 . 

Proof: (of Theorem^) We consider the matrix Xb G M(™ +1 ) xd whose first n rows are the 
rows of X and whose last row is the vector x^; here xg denotes the center of the minimum 
radius ball enclosing all n points. Then, the SVD of is equal to X# = UbSbV^, 
where U B G R( n+1 ) x /> s , S B G R pBXpB , and V G R dxpB . Here p B is the rank of the matrix 
Xe and clearly p B < p + 1. (Recall that p is the rank of the matrix X.) Let B be the 
radius of the minimal radius ball enclosing all n points in the original space. Then, for any 
i = l, n, 



B 2 > 



X; 



II 2 

x bII 2 



(e, - e n+ i) T X 



B 



(15) 



Now consider the matrix X^R and notice that 



< 



— e n +i) T XB 



(e, - e n+1 ) X#R 



(e* - e n+1 ) J (X B Xg - X B RR T X£) (e, - e n+1 ) 

(ej - e n+ i) T U b S b E b SbUb (e, - e n+1 

■ 2 

E_b|| 2 (e^ - e n+ i) J U^Sb 

T 



IE 



Bib 



[ ei - e n+ iY U B £ B Vj, 
2 



(e, 



X 



B 



In the above, we let E B G RPb*pb be the matrix that satisfies V^V B = V^RR T V B + E B , 
and we also used V^Vs = I. Now consider the ball whose center is the (n + l)-st row of 
the matrix X#R (essentially, the projection of the center of the minimal radius enclosing 



ball for the original points). Let i = argmax 



i=l...n 



(ej - e n+ i) T X B R 



; then, using the 
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above bound and eqn. (fT5|) . we get 

{ej - e n+ i) T X B R 

< (1+ ||E B || 2 )£ 2 . 

Thus, there exists a ball centered at e^ +1 XgR (the projected center of the minimal radius 
ball in the original space) with radius at most yl + [JEbU^B that encloses all the projected 
points. Recall that B is defined as the radius of the minimal radius ball that encloses all 
points in projected subspace; clearly, 

B 2 < (1+ \\B B \\ 2 )B 2 . 

Setting || || 2 to e concludes the proof of Theorem [2j 

o 

4 Experiments 

We describe experimental evaluations on two real-world datasets, namely a collection of 
document-term matrices (the TechTC-300 dataset (Davidov et al., 2004 [IS]) and a popu- 
lation genetics dataset (the joint Human Genome Diversity Panel or HGDP (Li et al., 2008 
|16j ) and the HapMap Phase 3 data ( Paschou et al., 2010 [17] ) and also on three synthetic 
datasets. The synthetic datasets and the TechTC-300 dataset correspond to binary classifi- 
cation tasks and the joint HapMap-HGDP dataset correspond to a multi-class classification 
task, and our algorithms perform well here as well. 

In our experimental evaluations, we implemented random projections using three dif- 
ferent methods: RS, FHT, and CW (see Section [2] for definitions) in MATLAB version 
7.13.0.564 (R2011b). We ran the algorithms using the same values of r (the dimension of 
the projected feature space) for all algorithms, but we varied r across different datasets. 
We used LIBSVM (Chang & Lin, 2011 [IB]) as our linear SVM solver with default settings. 
In all cases, we ran our experiments on the original full data (referred to as "full" in the 
results), as well as on the projected data. We partitioned the data randomly for ten- fold 
cross-validation in order to estimate out-of-sample error. We repeated this partitioning ten 
times to get ten ten-fold cross-validation experiments. In order to estimate the effect of 
the randomness in the construction of the random projection matrices, we repeated our 
cross-validation experiments ten times using ten different random projection matrices for 
all datasets. We report in-sample error (ej n ), out-of-sample error (e ou t), the time to com- 
pute random projections (t rp ), the total time needed to both compute random projections 
and run SVMs on the lower-dimensional problem (trun 

) , and the margin (7) . All results are 
averaged over the ten cross-validation experiments and the ten choices of random projection 
matrices. For each of the aforementioned quantities, we report both its mean value fi and 
its standard deviation a. For the multi-class experiment of Section H~3"l we do not report a 
margin. 

4.1 Synthetic datasets 

The synthetic datasets are separable by construction. More specifically, we first constructed 
a weight vector w G M. d , whose entries were selected in i.i.d. trials from a Gaussian distri- 
bution Af(fJ,, a) of mean \x and standard-deviation a. We experimented with the following 



< 



:i+ iiebI 



e- 



e n+l. 



X 



B 
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Table 1: Synthetic data: e out decreases as a function of r in all three families of matrices, using 
any of the three random projection methods, /i and a indicate the mean and the standard deviation 
of e ou t over ten matrices in each family Dl, D2, and D3, ten ten-fold cross-validation experiments, 
and ten choices of random projection matrices for the three methods that we investigated (a total 
of 1,000 experiments for each family of matrices). 



tout 


Projected Dimension r 






256 


512 


1024 


full 




cw 0) 


24.08 


19.45 


16.66 


15.10 


Dl 


to 


4.52 


4.15 


3.52 


2.60 




RS (/i) 


24.1.0 


19.46 


16.36 


15.10 




to 


4.45 


3.79 


3.22 


2.60 




FHT ( M ) 


23.52 


19.59 


16.67 


15.10 




to 


4.21 


4.05 


3.37 


2.60 




CW (//) 


25.94 


21.07 


17.33 


15.44 


D2 


to 


4.13 


4.16 


3.45 


2.54 




RS (/x) 


25.80 


20.80 


17.47 


15.44 




to 


4.40 


3.93 


3.42 


2.54 




FHT ( M ) 


25.33 


21.23 


17.58 


15.44 




to 


3.69 


4.24 


3.53 


2.54 




CW (/*) 


27.62 


22.97 


18.93 


15.83 


D3 


to 


3.46 


3.22 


3.32 


2.00 




RS (» 


28.15 


23.00 


18.72 


15.83 




to 


3.02 


3.48 


2.78 


2.00 




FHT ( M ) 


27.92 


23.41 


18.73 


15.83 




to 


3.46 


3.60 


3.02 


2.00 



three distributions: A/"(0, 1), A/"(l, 1.5), and M{2,2). Then, we normalized w to create 
w = w/ ||w|| 2 . Let Xjj = A/"(0, 1); then, we set x, to be equal to the i-th row of X, 
while y, = sign (w T Xj). We generated families of matrices of different dimensions. More 
specifically, family Dl contained matrices in ]j 200x5 > 000 ; family D2 contained matrices in 
]g>250x 10,000. anc i f am iiy £>3 contained matrices in M 300x 20 > 000 . We generated ten datasets for 
each of the families Dl, D2, and D3, and we report average results over the ten datasets. 
We set r to 256, 512, and 1024 and set C to 1,000 in LIBSVM for all the experiments. 
Tables Q] and [2] show e ou t and 7 for the three datasets Dl, D2, and D3. ei n is zero for 
all three data families. As expected, e ou t and 7 improve as r grows for all three random 
projection methods. Also, the time needed to compute random projections is very small 
compared to the time needed to run SVMs on the projected data. Figure [T] shows the 
combined running time of random projections and SVMs, which is nearly the same for all 
three random projection methods. It is obvious that this combined running time is much 
smaller that the time needed to run SVMs on the full dataset (with out any dimensionality 
reduction). For instance, for r = 1024, t run for Dl, D2, and D3 is (respectively) 6, 9, and 
25 times smaller than t run on the full-data. 

4.2 The TechTC-300 dataset 

For our first real dataset, we use the TechTC-300 data, consisting of a family of 295 
document-term data matrices. The TechTC-300 dataset comes from the Open Directory 
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Table 2: Synthetic data: 7 increases as a function of r in all three families of matrices. See the 
caption of Table 1 for an explanation of \x and a. 



7 


Projected Dimension r 






256 


512 


1024 


full 




CW {n) 


5.72 


6.67 


7.16 


7.74 


Dl 


(o-) 


0.58 


0.58 


0.59 


0.59 




RS (/x) 


5.73 


6.66 


7.18 


7.74 




(oj 


0.57 


0.55 


0.55 


0.59 




FHT ( M ) 


5.76 


6.64 


7.15 


7.74 






0.56 


0.58 


0.56 


0.59 




CW frt) 


6.62 


8.09 


8.88 


9.78 


D2 




0.64 


0.62 


0.59 


0.66 




RS (/x) 


6.65 


8.10 


8.88 


9.78 






0.64 


0.60 


0.63 


0.66 




FHT ( M ) 


6.66 


8.06 


8.84 


9.78 






0.63 


0.65 


0.63 


0.66 




CW (tt) 


7.69 


9.84 


11.07 


12.46 


D3 


(*) 


0.67 


0.60 


0.71 


0.69 




RS (At) 


7.61 


9.85 


11.05 


12.46 




(*) 


0.59 


0.6212 


0.62 


0.69 




FHT (a*) 


7.63 


9.83 


11.11 


12.46 






0.67 


0.64 


0.64 


0.69 



Project (ODP), which is a large, comprehensive directory of the web, maintained by vol- 
unteer editors. Each matrix in the TechTC-300 dataset contains a pair of categories from 
the ODP. Each category corresponds to a label, and thus the resulting classification task 
is binary. The documents that are collected from the union of all the subcategories within 
each category are represented in the bag-of-words model, with the words constituting the 
features of the data (Davidov et al., 2004 [15]). Each data matrix consists of 150-280 doc- 
uments (the rows of the data matrix X), and each document is described with respect to 
10,000-40,000 words (features, columns of the matrix X). Thus, TechTC-300 provides a 
diverse collection of data sets for a systematic study of the performance of the SVM on 
the projected versus full data. We set the parameter C to 500 in LIBSVM for all 295 
document-term matrices and set r to 128, 256, and 512. We use a lower value of C than for 
the other data sets for computational reasons: larger C is less efficient. We note that our 
classification accuracy is slightly worse (on the full data) than the accuracy presented in 
Section 4.4 of (Davidov et al., 2004 [E]), because we did not fine-tune the SVM parameters 
as they did, since that is not the focus of this study. For every dataset and every value 
of r we tried, the in-sample error on the projected data matched the in-sample error on 
the full data. We thus focus on e ou t, the margin 7, the time needed to compute random 
projections t rp , and the total running time t run . We report our results averaged over 295 
data matrices. Table [3] shows the behavior of these parameters for different choices of r. 
As expected, e ou t and the margin 7 improve as r increases, and they are nearly identical 
for all three random projection methods. The time needed to compute random projections 
is smallest for CW, followed by RS and FHT. As a matter of fact, t rp for CW is ten to 20 
times faster than RS and FHT for different values of r. This is predicted by the theory in 
(Clarkson & Woodruff, 2012 [4]), since CW is optimized to take advantage of input sparsity. 
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D3 

Figure 1: Total (average) running times, in seconds, of random projections and SVMs on the 
lower-dimensional data for each of the three families of synthetic data. Vertical bars indicate the, 
relatively small, standard deviation (see the caption of Table 1). 
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Table 3: Results on the Techtc300 dataset, averaged over 295 data matrices using three different 
random projection methods. The table shows how e out , 7, t rp (in seconds), and t run (in seconds) 
depend on r. /1 and a indicate the mean and the standard deviation of each quantity over 295 
matrices, ten ten-fold cross-validation experiments, and ten choices of random projection matrices 
for the three methods that we investigated. 





Projected Dimension r 






128 


256 


512 


full 




CW(/i) 


24.63 


22.84 


21.26 


17.35 


Cout 




10.57 


10.37 


10.17 


9.45 




RS(/i) 


24.58 


22.90 


21.38 


17.35 




(*) 


10.57 


10.39 


10.23 


9.45 




FHT ( M ) 


24.63 


22.93 


21.35 


17.35 






10.66 


10.39 


10.2 


9.45 




CW (/i) 


1.66 


1.88 


1.99 


2.09 


7 




3.68 


3.79 


3.92 


4.00 




RS (/*) 


1.66 


1.88 


1.99 


2.09 






3.65 


3.80 


3.91 


4.00 




FHT ( M ) 


1.66 


1.88 


1.98 


2.09 




to 


3.65 


3.81 


3.88 


4.00 




CW (/i) 


0.0046 


0.0059 


0.0075 






V) 


0.0019 


0.0026 


0.0033 






RS (//) 


0.0429 


0.0855 


0.1719 








0.0178 


0.0356 


0.072 






FHT (ja) 


0.0443 


0.0882 


0.1764 






to 


0.0206 


0.0413 


0.0825 






CW (/i) 


1.23 


2.22 


4.63 


4.85 


trim 


to 


0.87 


0.93 


1.93 


2.12 




RS (/i) 


0.99 


1.53 


3.02 


4.85 




to 


0.97 


0.59 


1.12 


2.12 




FHT ( M ) 


0.95 


1.46 


2.83 


4.85 




to 


0.96 


0.55 


1.02 


2.12 



However, this advantage is lost when SVMs are applied on the dimensionally-reduced data. 
Indeed, the combined running time t run is fastest for FHT, followed by RS and CW. In all 
cases, the total running time is smaller than the SVM running time on full dataset. For 
example, in the case of FHT, setting r = 512 achieves a running time t run which is about 
70% faster than running SVMs on the full dataset; e ou t increases by less than 4%. 

4.3 The HapMap-HGDP dataset 

Predicting ancestry of individuals using a set of genetic markers is a well-studied classifi- 
cation problem. We use a population genetics dataset from the Human Genome Diversity 
Panel (HGDP) and the HapMap Phase 3 dataset (see (Paschou et al., 2010 [IT]) for details), 
in order to classify individuals into broad geographic regions, as well as into (finer-scale) 
populations. We study a total of 2,250 individuals from approximately 50 populations and 
five broad geographic regions. The features in this dataset correspond to 492, 516 Single 
Nucleotide Polymorphisms (SNPs), which are well-known biallelic loci of genetic variation 
across the human genome. Each entry in the resulting 2,250 x 492,516 matrix is set to 
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-FHT 



256 



512 1024 
Projections 

Regional classification 



2048 











— cw 

RS 

FHT 

























o 20 

LU 

15 
10 
5 



256 512 1024 2048 

Projections 

Population-level classification 

Figure 2: e out as a function of r in the Hapmap-HGDP dataset for three different random projection 
methods and two different classification tasks. Vertical bars indicate the standard-deviation over 
the ten ten- fold cross-validation experiments and the ten choices of the random projection matrices 
for each of the three methods. 



+ 1 (homozygotic in one allele), —1 (homozygotic in the other allele), or (heterozygotic) , 
depending on the genotype of the respective SNP for a particular sample. Missing entries 
were filled in with —1, +1, or 0, with probability 1/3. Each sample has a known population 
and region of origin, which constitute its label. We set r to 256, 512, 1024, and 2048 in 
our experiments. Since this task is a multi-class classification problem, we used LIBSVM's 
one-against-one technique for classification. We ran two sets of experiments: in the first set, 
the classification problem is to assign samples to broad regions of origin, while in the second 
experiment, our goal is to classify samples into (fine-scale) populations. We set C to 1,000 
in LIBSVM for all the experiments. The in-sample error is zero in all cases. Figure [2] shows 
the out-of-sample error for regions and populations classification, which are nearly identical 
for all three random projection methods. For regional classification, we estimated e ou t to 
be close to 2%, and for population-level classification, e ou t is close to 20%. This experiment 
strongly supports the computational benefits of our methods in terms of main memory. X 
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120r 



— cw 

RS 

100- FHT 
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J3- 60- 



40- 



20- 




256 



512 1024 
Projections 



2048 



Time needed to compute random projections 

Figure 3: Total running time in seconds (random projections and SVM classification on the 
dimensionally-reduced data) for Hapmap-HGDP dataset for three different projection methods us- 
ing both regional and population-level labels. Notice that the time needed to compute random 
projection is independent of the classification labels. Vertical bars indicate standard-deviation, as 
in Figure [2] 



is 2,250 x 492,516, which is too large to fit into memory in order to run SVMs. Figure [3] 
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shows that the combined running time for three different random projection methods are 
nearly identical for both regions and population classification tasks. However, the time 
needed to compute the random projections is different from one method to the next. FHT 
is fastest, followed by RS and CW. In this particular case, the input matrix in quite dense, 
and CW seems to be outperformed by the other two methods. 

5 Conclusions and open problems 

We present theoretical and empirical results indicating that random projections are a useful 
dimensionality reduction technique for SVM classification problems that handle sparse or 
dense data in high-dimensional feature spaces. Our theory predicts that the dimensionality 
of the projected space (denoted by r) has to grow essentially linearly (up to logarithmic fac- 
tors) in p (the rank of the data matrix) in order to achieve relative error approximations to 
the margin and the radius of the minimum ball enclosing the data. Such relative-error ap- 
proximations imply excellent generalization performance. However, our experiments show 
that considerably smaller values for r (e.g., in the case of the TechTC data, setting r to 
1/70-th of all available features) results in classification that is essentially as accurate as 
running SVMs on all available features, despite the fact that the matrices have full numer- 
ical rank. This seems to imply that our theoretical results can be improved. FHT and 
RS work well on dense data while CW is an excellent choice for sparse data, as indicated 
by our experiments. However, this solid performance of CW (which is predicted by the 
theoretical bounds of (Clarkson & Woodruff, 2012 [4j)) comes at a cost, at least according 
to our experimental evaluation: solving the SVM optimization problem on the resulting 
low-dimensional dataset is quite expensive, and, as a result, the total running time of the 
CW method is eventually higher than that of FHT and RS. This seems to indicate that 
more research is necessary in terms of random projection methods that are both fast (e.g., 
can be applied on the input matrix in time that is proportional to the number of non-zero 
entries in the matrix), but also result in low-dimensional data matrices that are "friendly" 
(e.g., correspond to well-structured problem instances) for SVM solvers. Understanding 
this aspect of random projection matrices is important and it has not been investigated at 
all in existing literature. 
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Appendix 

5.1 Proof of lemmas used in Theorem 1 and 2 

Proof, (of Lemma 1) Consider the matrix V r R = V T DHS. Using Lemma 3 of [19], 

2pln(40dp) 



HDVU 



('•) 



< 



d 



(21n(40dp)) _1 



(HDV) K 



(0 



2 



1 

< - 



p d 

holds for all % = 1, . .. ,d with probability at least 1 — 5. In the above, the notation A(j) 
denotes the i-ih row of A as a row vector. Applying Theorem 4 with f3 = (2 In (40dp)) 1 
( [E]> Appendix) concludes the lemma. □ 

Lemma 2. Fix e E (0, ^] and let V £ M. dxp be any matrix with stable rank at most p. Let 
R be a d x p random sign matrix rescaled by 1/y/i. If r = O (pe~ 2 log p log d) , then with 
probability at least 1 — 1/n, 

||V T V - V T RR T V|| 2 < e. 
Proof. The proof of this result is essentially the same, using Theorem 3.1 (i) of [2]. □ 

Lemma 3. Let e £ (0, 1) and R be the random projection matrix constructed as described 
in of dimensions d x r , with r = O (/oe -4 log (p/e) {p + log (1/e))) , and assume that the 
following is true : For any vector x, 

(1 - e) ||Vz||2 < ||-R T ^||2 < C 1 + e ) \\ Vx \\l ■ 
Then, ||V T V - V T RR T V|| 2 < e. 
Proof. The assumption, 

(1 - e) ||Vx||2 < ||R T Vx||2 < (1 + e) ||Vx||| 

is equivalent to 

(1 - e) x T V T Vx < x T V T RR T Vx < (1 + e) x T V T Vx, 
which in turn is equivalent to (focus on the second inequality), 

x T (V T RR T V - V r V) x < ex T V T Vx. 
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Apply this for x = y where y is the eigenvector of (V T RR T V — V T V) that corresponds 
to the largest eigenvalue of (V T RR T V - V T V) So, 

X max (V T RR T V - V T V) < ey T y. 

Notice that y T has unit norm, so X m ax (V T RR T V — V T V) < e. 
Now since (V T RR T V — V T V) is symmetric, so 

Xmax (V T RR T V - V T V) = ||V T V - V T RR T V|| 2 < e. 

□ 
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